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ABSTRACT 

Information  Retrieval  is  an  emerging  research  area  in  the  field  of  Information  Retrieval.  Due  to  the  immense 
amount  of  data  in  the  WWW,  it  is  very  tough  for  the  user  to  retrieve  the  relevant  images.  Traditional  Image  Retrieval 
approaches  based  on  topic  similarity  alone  is  not  sufficient  nowadays  the  content  based  image  retrieval  (CBIR)  are 
becoming  a  source  of  exact  and  fast  retrieval.  A  variety  of  techniques  have  been  developed  to  improve  the  performance  of 
CBIR.  Data  clustering  is  an  unsupervised  method  for  extraction  hidden  pattern  from  huge  data  sets.  With  large  data  sets, 
there  is  possibility  of  high  dimensionality.  Having  both  accuracy  and  efficiency  for  high  dimensional  data  sets  with 
enormous  number  of  samples  is  a  challenging  arena.  In  this  paper  the  clustering  techniques  are  discussed  and  analysed. 
Also,  we  propose  a  method  HDK  that  uses  more  than  one  clustering  technique  to  improve  the  performance  of  CBIR. This 
method  makes  use  of  hierachical  and  divide  and  conquer  K-  Means  clustering  technique  with  equivalency  and  compatible 
relation  concepts  to  improve  the  performance  of  the  K-Means  for  using  in  high  dimensional  datasets.  It  also  introduced 
the  feature  like  color,  texture  and  shape  for  accurate  and  effective  retrieval  system.  This  survey  gives  an  introduction  to 
content-based  image  Retrieval  and  explores  the  different  types  of  retrieval  methods 

KEYWORDS:  CBIR,  Image  Feature  Extraction,  Image  Analysis,  Image  Retrieval,  Image  Similarity  Clustering 
Techniques 

INTRODUCTION 

Content-based  image  retrieval  (CBIR),  also  known  as  query  by  image  content  (QBIC)  and  content-based  visual 
information  retrieval  (CBVIR)  is  the  application  of  computer  vision  techniques  to  the  image  retrieval  problem,  that  is,  the 
problem  of  searching  for  digital  images  in  large  databases  (see  this  survey'11  for  a  recent  scientific  overview  of  the  CBIR 
field).  Content-based  image  retrieval  is  opposed  to  concept-based  approaches  . 

"Content-based"  means  that  the  search  analyzes  the  contents  of  the  image  rather  than  themetadata  such  as 
keywords,  tags,  or  descriptions  associated  with  the  image.  The  term  "content"  in  this  context  might  refer  to  colors,  shapes, 
textures,  or  any  other  information  that  can  be  derived  from  the  image  itself.  CBIR  is  desirable  because  most  web-based 
image  search  engines  rely  purely  on  metadata  and  this  produces  a  lot  of  garbage  in  the  results  Also  having  humans 
manually  enter  keywords  for  images  in  a  large  database  can  be  inefficient,  expensive  and  may  not  capture  every  keyword 
that  describes  the  image.  Thus  a  system  that  can  filter  images  based  on  their  content  would  provide  better  indexing  and 
return  more  accurate  results. 

The  term  "content-based  image  retrieval"  seems  to  have  originated  in  1992  when  it  was  used  by  T.  Kato  to 
describe  experiments  into  automatic  retrieval  of  images  from  a  database,  based  on  the  colors  and  shapes  present.121  Since 
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then,  the  term  has  been  used  to  describe  the  process  of  retrieving  desired  images  from  a  large  collection  on  the  basis  of 
syntactical  image  features.  The  techniques,  tools,  and  algorithms  that  are  used  originate  from  fields  such  as  statistics, 
pattern  recognition,  signal  processing,  and  computer  vision. 

There  is  a  growing  interest  in  CBIR  because  of  the  limitations  inherent  in  metadata-based  systems,  as  well  as  the 
large  range  of  possible  uses  for  efficient  image  retrieval.  Textual  information  about  images  can  be  easily  searched  using 
existing  technology,  but  this  requires  humans  to  manually  describe  each  image  in  the  database.  This  is  impractical  for  very 
large  databases  or  for  images  that  are  generated  automatically,  e.g.  those  from  surveillance  cameras.  It  is  also  possible  to 
miss  images  that  use  different  synonyms  in  their  descriptions.  Systems  based  on  categorizing  images  in  semantic  classes 
like  "cat"  as  a  subclass  of  "animal"  avoid  this  problem  but  still  face  the  same  scaling  issues. 

CBIR  TECHNIQUES 

Many  CBIR  systems  have  been  developed,  but  the  problem  of  retrieving  images  on  the  basis  of  their  pixel  content 
remains  largely  unsolved. 

Query  Techniques 

Different  implementations  of  CBIR  make  use  of  different  types  of  user  queries.  Query  by  example  is  a  query 
technique  that  involves  providing  the  CBIR  system  with  an  example  image  that  it  will  then  base  its  search  upon.  The 
underlying  search  algorithms  may  vary  depending  on  the  application,  but  result  images  should  all  share  common  elements 
with  the  provided  example. 

Options  for  providing  example  images  to  the  system  include: 

•  A  preexisting  image  may  be  supplied  by  the  user  or  chosen  from  a  random  set. 

•  The  user  draws  a  rough  approximation  of  the  image  they  are  looking  for,  for  example  with  blobs  of  color  or 
general  shapes. 

This  query  technique  removes  the  difficulties  that  can  arise  when  trying  to  describe  images  with  words. 
Semantic  Retrieval 

The  ideal  CBIR  system  from  a  user  perspective  would  involve  what  is  referred  to  as  semantic  retrieval,  where  the 
user  makes  a  request  like  "find  pictures  of  Abraham  Lincoln".  This  type  of  open-ended  task  is  very  difficult  for  computers 
to  perform  -  pictures  of  chihuahuas  and  Great  Danes  look  very  different,  and  Lincoln  may  not  always  be  facing  the  camera 
or  in  the  same  pose.  Current  CBIR  systems  therefore  generally  make  use  of  lower-level  features  like  texture,  color,  and 
shape,  although  some  systems  take  advantage  of  very  common  higher-level  features  like  faces  .  Not  every  CBIR  system  is 
generic.  Some  systems  are  designed  for  a  specific  domain,  e.g.  shape  matching  can  be  used  for  finding  parts  inside  a  CAD- 
CAM  database. 

Other  Query  Methods 

Other  query  methods  include  browsing  for  example  images,  navigating  customized/hierarchical  categories, 
querying  by  image  region  (rather  than  the  entire  image),  querying  by  multiple  example  images,  querying  by  visual  sketch, 
querying  by  direct  specification  of  image  features,  and  multimodal  queries  (e.g.  combining  touch,  voice,  etc.) 

CBIR  systems  can  also  make  use  of  relevance  feedback,  where  the  user  progressively  refines  the  search  results  by 
marking  images  in  the  results  as  "relevant",  "not  relevant",  or  "neutral"  to  the  search  query,  then  repeating  the  search  with 
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the  new  information. 

Content  Comparison  Using  Image  Distance  Measures 

The  most  common  method  for  comparing  two  images  in  content-based  image  retrieval  (typically  an  example 
image  and  an  image  from  the  database)  is  using  an  image  distance  measure.  An  image  distance  measure  compares  the 
similarity  of  two  images  in  various  dimensions  such  as  color,  texture,  shape,  and  others.  For  example  a  distance  of  0 
signifies  an  exact  match  with  the  query,  with  respect  to  the  dimensions  that  were  considered.  As  one  may  intuitively 
gather,  a  value  greater  than  0  indicates  various  degrees  of  similarities  between  the  images.  Search  results  then  can  be  sorted 
based  on  their  distance  to  the  queried  image.01  A  long  list  of  distance  measures  can  be  found  in. 

Color 

Computing  distance  measures  based  on  color  similarity  is  achieved  by  computing  a  color  histogram  for  each 
image  that  identifies  the  proportion  of  pixels  within  an  image  holding  specific  values  (that  humans  express  as  colors). 
Current  research  is  attempting  to  segment  color  proportion  by  region  and  by  spatial  relationship  among  several  color 
regions.  Examining  images  based  on  the  colors  they  contain  is  one  of  the  most  widely  used  techniques  because  it  does  not 
depend  on  image  size  or  orientation.  Color  searches  will  usually  involve  comparing  color  histograms,  though  this  is  not  the 
only  technique  in  practice. 

Texture 

Texture  measures  look  for  visual  patterns  in  images  and  how  they  are  spatially  defined.  Textures  are  represented 
by  texels  which  are  then  placed  into  a  number  of  sets,  depending  on  how  many  textures  are  detected  in  the  image.  These 
sets  not  only  define  the  texture,  but  also  where  in  the  image  the  texture  is  located. 

Texture  is  a  difficult  concept  to  represent.  The  identification  of  specific  textures  in  an  image  is  achieved  primarily 
by  modeling  texture  as  a  two-dimensional  gray  level  variation.  The  relative  brightness  of  pairs  of  pixels  is  computed  such 
that  degree  of  contrast,  regularity,  coarseness  and  directionality  may  be  estimated  (Tamura,  Mori  &  Yamawaki,  1978). 
However,  the  problem  is  in  identifying  patterns  of  co-pixel  variation  and  associating  them  with  particular  classes  of 
textures  such  as  silky,  or  rough. 

Shape 

Shape  does  not  refer  to  the  shape  of  an  image  but  to  the  shape  of  a  particular  region  that  is  being  sought  out.  Shapes 
will  often  be  determined  first  applying  segmentation  or  edge  detection  to  an  image.  Other  methods  like  [Tushabe  and 
Wilkinson  2008]  use  shape  filters  to  identify  given  shapes  of  an  image.  In  some  case  accurate  shape  detection  will  require 
human  intervention  because  methods  like  segmentation  are  very  difficult  to  completely  automate. 

THE  RETRIEVAL  BASED  ON  CLUSTERING  TECHNIQUES 

Clustering  techniques  can  be  classified  into  supervised  (including  semi-supervised)  and  unsupervised  schemes. 
The  former  consists  of  hierarchical  approaches  that  demand  human  interaction  to  generate  splitting  criteria  for  clustering. 
In  unsupervised  classification,  called  clustering  or  exploratory  data  analysis,  no  labeled  data  are  available  The  goal  of 
clustering  is  to  separate  a  finite  unlabeled  data  set  into  a  finite  and  discrete  set  of  "natural,"  hidden  data  structures, 
rather  than  provide  an  accurate  characterization  of  unobserved  samples  generated  from  the  same  probability 
distribution  This  paper  critically  reviews  and  summarizes  different  clustering  techniques. 
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Log  -Based  Clustering 

Images  can  be  clustered  based  on  the  retrieval  system  logs  maintained  by  an  information  retrieval  process.  The 
session  keys  are  created  and  accessed  for  retrieval.  Through  this  the  session  clusters  are  created.  Each  session  cluster 
generates  log  -based  document  and  similarity  of  image  couple  is  retrieved.  Log  -based  vector  is  created  for  each 
session  vector  based  on  the  log-based  document.  Now,  the  session  cluster  is  replaced  with  this  vector.  The  unaccessed 
documents  creates  its  own  vector. 

A  hybrid  matrix  is  generated  with  at  least  one  individual  document  vector  and  one  log-based  clustered 
vector.  At  last  the  hybrid  matrix  is  clustered.  This  technique  is  difficult  to  perform  in  the  case  of  multidimensional  images. 
To  overcome  this  hierarchical  clustering  is  adopted. 

Hierarchical  Clustering 

Hierarchical  clustering  (HC)  algorithms  organize  data  into  a  hierarchical  structure  according  to  the  proximity 
matrix.  The  results  of  HC  are  usually  depicted  by  a  binary  tree  or  dendrogram  as  shown  in  Figure  lwhere  A,  B,  C,  D, 
E,  F,  G  are  objects  or  clusters.  It  represents  the  nested  grouping  of  patterns  and  similarity  levels  at  which  groupings 
change.  The  root  node  of  the  dendrogram  represents  the  whole  data  set  and  each  leaf  node  is  regarded  as  a  data 
object.  The  intermediate  nodes,  thus,  describe  the  extent  that  the  objects  are  proximal  to  each  other;  and  the  height  of 
the  dendrogram  usually  expresses  the  distance  between  each  pair  of  objects  or  clusters,  or  an  object  and  a  cluster.  The 
ultimate  clustering  results  can  be  obtained  by  cutting  the  dendrogram  at  different  levels.  This  representation  provides  very 
informative  descriptions  and  visualization  for  the  potential  data  clustering  structures,  especially  when  real  hierarchical 
relations  exist  in  the  data,  like  the  data  from  evolutionary  research  on  different  species  of  organizms.  HC  algorithms  are 
mainly  classified  as  agglomerative  methods  and  divisive  methods.  Agglomerative  clustering  starts  with  clusters  and  each 
of  them  includes  exactly  one  object.  A  series  of  merge  operations  are  then  followed  out  that  finally  lead  all  objects  to 
the  same  group.  Divisive  clustering  proceeds  in  an  opposite  way.  In  the  beginning,  the  entire  data  set  belongs  to  a  cluster 
and  a  procedure  successively  divides  it  until  all  clusters  are  singleton  clusters.  For  a  cluster  with  objects,  there  are 
2N1-1  possible  two-subset  divisions,  which  is  very  expensive  in  computation.  Therefore,  divisive  clustering  is  not 
commonly  used  in  practice.  In  recent  years,  with  the  requirement  for  handling  large-scale  data  sets  in  data  mining 
and  other  fields,  many  new  HC  techniques  have  appeared  and  greatly  improved  the  clustering  performance. 

Retrieval  Dictionary  Based  Clustering 

A  rough  classification  retrieval  system  is  formed.  This  is  formed  by  calculating  the  distance  between  two 
learned  patterns  and  these  learned  patterns  are  classified  into  different  clusters  followed  by  a  retrieval  stage.  The  main 
drawback  addressed  in  this  system  is  the  determination  of  the  distance.  To  overcome  this  problem  a  retrieval  system  is 
developed  by  retrieval  dictionary  based  clustering.  This  method  has  a  retrieval  dictionary  generation  unit  that  classifies 
learned  patterns  into  plural  clusters  and  creates  a  retrieval  dictionary  using  the  clusters.  Here,  the  image  is  retrieved 
based  on  the  distance  between  two  spheres  with  different  radii.  Each  radius  is  a  similarity  measure  between  central 
cluster  and  an  input  image.  An  image  which  is  similar  to  the  query  image  will  be  retrieved  using  retrieval  dictionary. 

NCut  Algorithm 

Ncut  method  attempts  to  organize  nodes  into  groups  so  that  the  within  the  group  similarity  is  high,  and/or 
between  the  groups  similarity  is  low.  This  method  is  empirically  shown  to  be  relatively  robust  in  image 
segmentation.  This  method  can  be  recursively  applied  to  get  more  than  two  clusters.  In  this  method  each  time  the 
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subgraph  with  maximum  number  of  nodes  is  partitioned  (random  selection  for  tie  breaking).  The  process  terminates 
when  the  bound  on  the  number  of  clusters  is  reached  or  the  Ncut  value  exceeds  some  threshold  T.  The  recursive  Ncut 
partition  is  essentially  a  hierarchical  divisive  clustering  process  that  produces  a  tree.  Nonetheless,  the  tree  organization 
here  may  misleading  a  user  because  there  is  no  guarantee  of  any  correspondence  between  the  tree  and  the  semantic 
structure  of  images.  Furthermore,  organizing  image  clusters  into  a  tree  structure  will  significantly  complicate  the  user 
interface. 

K  Means  Clustering 

This  nonhierairchal  method  initially  takes  the  number  of  components  of  the  population  equal  to  the  final 
required  number  of  clusters.  In  this  step  itself  the  final  required  number  of  clusters  is  chosen  such  that  the  points  are 
mutually  farthest  apart.  Next,  it  examines  each  component  in  the  population  and  assigns  it  to  one  of  the  clusters 
depending  on  the  minimum  distance.  The  centroid's  position  is  recalculated  everytime  a  component  is  added  to  the  cluster 
and  this  continues  until  all  the  components  are  grouped  into  the  final  required  number  of  clusters. The  K-  means 
algorithm  is  very  simple  and  can  be  easily  implemented  in  solving  many  practical  problems.  It  can  work  very  well  for 
compact  and  hyperspherical  clusters.  The  time  complexity  of  K-means  is  O  (NKd).  Since  K  and  d  are  usually  much  less 
than  N,K-means  can  be  used  to  cluster  large  data  sets.  Parallel  techniques  for  K-means  are  developed  that  can  largely 
accelerate  the  algorithm.  Incremental  clustering  techniques  for  example  (Bradley  et  al.,  1998)  do  not  require  the  storage  of 
the  entire  data  set,  and  can  handle  it  in  a  one-pattern-at-a-time  way.  If  the  pattern  displays  enough  closeness  to  a 
cluster  according  to  some  predefined  criteria,  it  is  assigned  to  the  cluster.  Otherwise,  a  new  cluster  is  created  to  represent 
the  object. 

Graph  Theory  Based  Clustering 

The  concepts  and  properties  of  graph  theory  make  it  very  convenient  to  describe  clustering  problems  by  means 
of  graphs.  Nodes  of  a  weighted  graph  correspond  to  data  points  in  the  pattern  space  and  edges  reflect  the  proximities 
between  each  pair  of  data  points.  A  graph-based  clustering  method  is  particularly  well  suited  for  dealing  with  data  that  is 
used  in  the  construction  of  minimum  spanning  tree  MST.  It  can  be  used  for  detecting  clusters  of  any  size  and  shape 
without  specifying  the  actual  number  of  clusters.  Well  known  algorithms  in  clustering  are  Minimum  Spanning  Tree  based 
clustering,  and  clustering  editing  method,  HCS  algorithm,  etc.  Current  research  is  focused  on  clustering  using  divide 
and  conquers  approach.  Usually  this  clustering  methodology  is  used  to  detect  irregular  clustering  boundaries  in  clustering 
results.  Zhan  proposes  to  construct  an  MST  and  delete  the  inconsistent  edges,  i.e.  the  edges  weight  values  are  significantly 
larger  than  average  weight  of  the  nearby  edges  in  the  tree.  The  inconsistency  measure  is  applied  to  each  edge  to  detect 
and  remove  the  inconsistence  edges,  which  results  as  a  set  of  disjoint  subtrees,  each  subtree  will  represent  a  separate 
cluster 

Divide  and  Conquer  K-Means 

When  the  size  of  a  data  set  is  too  large,  it  is  possible  to  divide  the  data  into  different  subsets  and  to  use  the 
selected  cluster  algorithm  separately  to  these  subsets.  This  approach  is  known  as  divide  and  conquer  .  The  divide  and 
conquer  algorithm  first  divides  the  entire  data  set  into  a  subset  based  on  some  criteria.  The  selected  subset  is  again 
clustered  with  a  clustering  algorithm  K-Means.  The  advantage  is  to  accelerate  search  and  to  reduce  complexity  which 
depends  on  number  of  samples.  Methods  based  on  subspace  clustering  may  help  to  ease  the  problem  of  clustering 
high-dimensional  data,  but  they  are  not  adapted  at  obtaining  a  large  number  of  clusters  .A  possible  solution  to  this  issue, 
is  to  cluster  hierarchically  (obtain  a  small  number  of  clusters  and  then  cluster  again  each  of  the  clusters  obtained).  The 
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proposed  enhanced  clustering  method  HDK  which  uses  the  combination  of  unsupervised  clustering  methods  is  one  of 
the  method  that  can  largely  accelerate  the  CBIR  system. 

CONCLUSIONS 

The  purpose  of  this  survey  is  to  provide  an  overview  of  the  functionality  of  content  based  image  retrieval 
systems.  Combining  advantages  of  HC  and  divide  and  conquer  K-Means  strategy  can  help  us  in  both  efficiency  and 
quality.  HC  algorithm  can  construct  structured  clusters.  Although  HC  yields  high  quality  clusters  but  its  complexity  is 
quadratic  and  is  not  suitable  for  huge  datasets  and  high  dimension  data.  In  contrast  K-Means  is  linear  with  size  of  data 
set  and  dimension  and  can  be  used  for  big  datasets  that  yields  low  quality.  Divide  and  conquer  K-Means  can  be  used  for 
high  dimensional  data  set  .  In  this  paper  we  present  a  method  HDK  to  use  both  advantages  of  HC  and  Divide  and  conquer 
K-Means  by  introducing  equivalency  and  compatible  relation  concepts.  Using  two  steps  clustering  in  high  dimensional 
data  sets  with  considering  no  of  clusters  based  on  color  feature  helps  us  to  improve  accuracy  and  efficiency  of  original  K- 
Means  clustering.  For  this  purpose  we  should  consider  orthogonal  space.  HDK  algorithm  has  been  used  extensively 
in  various  areas  to  improve  the  performance  of  the  system  and  to  achieve  better  results  in  different  applications. 
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